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Abstract 

In this paper, we consider the problem of manifold approximation with affine subspaces. Our objective 
is to discover a set of low dimensional affine subspaces that represents manifold data accurately while 
preserving the manifold's structure. For this purpose, we employ a greedy technique that partitions 
manifold samples into groups that can be each approximated by a low dimensional subspace. We start 
by considering each manifold sample as a different group and we use the difference of tangents to 
determine appropriate group mergings. We repeat this procedure until we reach the desired number of 
sample groups. The best low dimensional affine subspaces corresponding to the final groups constitute our 
approximate manifold representation. Our experiments verify the effectiveness of the proposed scheme 
and show its superior performance compared to state-of-the-art methods for manifold approximation. 

Index Terms 

manifold approximation, tangent space, affine subspaces, flats. 

I. Introduction 

The curse of dimensionality is one of the most fundamental issues that researchers, across various 
data processing disciplines, have to face. High dimensional data that is difficult to even store or transmit, 
huge parametric spaces that are challenging to exploit and complex models that are difficult to learn and 
prone to over-fitting, are some simple evidences of the dimensionality problem. However, it is not rare 
that the data presents some underlying structure, which can lead to more efficient data representation and 
analysis if modeled properly. 

S. Karygianni and P. Frossard are with Ecole Polytechnique Federale de Lausanne (EPFL), Signal Processing Laboratory-LTS4, 
CH-1015, Lausanne, Switzerland (e-mail: {sofia.karygianni, pascal.frossard}@epfl.ch). 
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In many cases, the underlying structure of the signals of a given family can be described adequately 
by a manifold model that has a smaller dimensionality than the signal space. Prominent examples are 
signals that are related by transformations, like images captured under different viewpoints in a 3D 
scene, or signals that represent different observations of the same physical phenomenon like the EEG 
and the ECG. Manifolds have already been adopted with success in many different applications like 
transformation-invariant classification, recognition and dimensionality reduction Q, |[2]|, O. 

In general, manifolds are topological spaces that locally resemble a Euclidean space. Therefore, 
although as a whole they might be extremely complicated structures, locally, i.e., in the neighborhood of 
a point, they have the same characteristics as the usual Euclidean space. In this work, we are going to 
consider d-dimensional, differentiable manifolds that are embedded into a higher dimensional Euclidean 
space, R^^N » d. Intuitively, one can think of a d-dimensional manifold embedded into as the 
generalization of a surface in N dimensions: it is a set of points that locally seem to live in but that 
macroscopically synthesize a structure living into R^. For example, a sphere in R^ and a circle in R^ 
are both manifolds of dimension 2 and 1 respectively. Although manifolds are appealing for effective 
data representation, their unknown and usually strongly non-linear structure makes their manipulation 
quite challenging. There are cases where an analytical model can represent the manifold, like a model 
built on linear combinations of atoms coming from a predefined dictionary [4]. However, an analytical 
model is unfortunately not always available. A workaround consists in trying to infer a global, data- 
driven parametrization scheme for the manifold by mapping the manifold data from the original space to 
a low-dimensional parametric space. The problem of unveiling such a parametrization is called manifold 
learning. 

Usually, it is hard to discover a universal manifold representation that is always accurate as it means 
that all the non-linearities of the manifold are well represented by only one mapping function. Therefore, 
instead of using just one global scheme, it is often preferable to employ a set of simpler structures to 
approximate the manifold's geometry. This can be done in the original space where the manifold lives. 
The objective of the approximation is to create a manifold model that is as simple as possible while 
preserving the most crucial characteristic of a manifold: its shape. An example of such an approximation 



for an ID manifold is shown in Figure [Ta| where a set of lines represents the spiral shape. 

In this paper, we approximate generic manifolds with the simplest possible models, the affine subspaces 
(flats). Such a choice is motivated by the locally linear character of a manifold as well as the simplicity 
and efficiency of flats for performing local computations like projections. Our objective is to compute 
a set of low dimensional flats that represents the data as accurately as possible and at the same time 
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(a) Good manifold approximation example (b) Bad manifold approximation example 

Fig. 1 : Manifold approximation illustration. On the left, we have an example of a valid approximation by 
lines of a ID manifold embedded into R^. The different colors represent the different groups of samples, 
each approximated by a line. On the right, we have an example where the approximation does not align 
well with the manifold structure, as a result of the median k- flats algorithm [i5i 

preserves the geometry of the underlying manifold. We formulate the manifold approximation problem 
as a constrained clustering problem for manifold samples. The constraints are related to the underlying 
geometry of the manifold, which is expressed by the neighborhood graph of the data samples. We borrow 
elements of the constrained clustering theory to motivate the use of a greedy approximation scheme. For 
choosing our optimization function, we relate the capability of a set of points to be represented by a 
flat, with the variance of the tangents at these points. Then, we use the difference of tangents to uncover 
groups of points that comply with the low dimensionality of flats. The partitioning is done in a bottom- 
up manner where each manifold sample is considered as a different group at the beginning. Groups are 
then iteratively merged until their number reduces to the desired value. We have tested our algorithm on 
both synthetic and real data where it gives a superior performance compared to state-of-the-art manifold 
approximation techniques. 
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The rest of the paper is organized as follows. In Section |lT| we discuss the related work in manifold 
approximation and other relevant fields like manifold learning and hybrid linear modeling. In Section 
Till we motivate the use of a greedy approximation strategy with concepts from constrained clustering 



theory and we present our problem formulation. We present in Section IV our approximation algorithm 



in detail. In Section [V} we describe the experimental setup and the results of our experiments. Finally, 



in Section VI we provide concluding remarks. 



II. Related Work 

Data representation with affine models has received quite some attention lately. Relative approaches 
usually fall under the name of either subspace clustering or hybrid linear modeling. Their objective is to 
find a set of affine models explaining the different data sources, i.e., to cluster the data into groups so 
that each group can be represented well by a low-dimensional affine space. A common approach is to 
use an iterative scheme to alternate between steps of data segmentation and subspace estimation aiming 
at either minimizing the sum of reconstruction errors IS, ||6| or maximizing the likelihood of the data 
under a probabilistic model, like probabilistic PCA Q. Alternatively, different kinds of algebro-geometric 
approaches have also been proposed. An interesting formulation has been presented in 181, where the 
problem of subspace clustering is transformed into a problem of fitting and manipulating polynomials. 
Moreover, in O, lITOll , the spectral analysis of an appropriately defined similarity matrix over the data 
is used to uncover the underlying low dimensional structures as well as the partition that favors them. 
Recently, in ifTTIl . the use of spectral analysis is combined with a multiscale analysis of the rate of 
growth of the local neighborhoods' eigenvalues, so that, apart from the appropriate clustering, the model 
parameters, number and dimensionality of the subspaces, are simultaneously recovered from the data. 
While they are quite successful at times, the above methods apply mainly to cases where data is generated 
from different low dimensional subspaces that do not necessarily form a manifold. And as such, they 
uncover a set of linear spaces that do not necessarily comply with the manifold structure, such as the set 



of lines shown in Figure lb 



As far as manifold-driven data is concerned, there is a great variety of works in the so called field 
of manifold learning and dimensionality reduction. The goal of manifold learning is to devise a low 
dimensional, global parametrization for data sets that lie on high dimensional non-linear manifolds, while 
preserving some properties of the underlying manifold. Two pioneer works in the field are the Isomap 
Q and the LLE algorithms |fT2l. In Isomap, the parametrization is uncovered in a way that preserves the 
geodesic distances between the points while in LLE the focus is on preserving the local linear properties 
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of neighborhoods. Other well known approaches that aim at preserving local properties of the points' 
neighborhoods are provided by the Laplacian Eigenmaps (LE) |[T3]| and the Hessian Eigenmaps (HLLE) 
|[T4l . Recently, these methods have been extended to points lying on Riemmanian manifolds as opposed 
to Euclidean spaces [2]. This opens the range of possible applications for manifold-based representations. 
A detailed list of the most popular algorithms for manifold learning can be found in f[5l and |[T6l , along 
with interesting comments on their relative strengths and weaknesses. 

In manifold approximation the goal is however, to represent the manifold structure in the original 
space. The ultimate target is not a global parametrization, but rather a set of local, affine subspaces that 
could approximate the original geometry accurately. Although the locally linear nature of manifolds has 
been used as a tool for learning a global parametrization by aligning or combining local probabilistic 
data models (e.g., ifTTll , ifTS^I), only a few works so far have tried to create a model of the manifold in 
the original space while preserving its structural properties. Two such examples are the works of Wang 
and Chen ||3l and Fan and Yeung |[T9]| . In |[3l, the authors introduce the Hierarchical Divisive Clustering 
(HDC) algorithm, which is a method for hierarchically partitioning the data by dividing highly non-linear 
clusters. As a linearity measure, it uses the deviation between the euclidean and geodesic distances. In 
|[T9l , the clustering is performed in a bottom-up manner, named Hierarchical Agglomerative Clustering 
(HAC), where again the geodesic distances are used to express the underlying manifold structure. 

In our work, we have chosen a bottom-up approach but we use a different linearity measure, namely 
the variance of the tangent spaces. As it will be shown in the next section, this measure emerges naturally 
from the definition of the local properties of a manifold while both linearity measures in |[T9ll and O 
are more simplistic. In fact, the importance of the tangent spaces for manifold related tasks has been 
recognized by many researchers. In |[20ll the authors use the tangent spaces to infer valid parametrizations 
of a manifold, and recently, in (2l\ the authors focus on the reliable estimation of the tangent spaces 
from the data. They also incorporate the tangent distance into a variation of a k-means algorithm to 
classify samples into linear groups, which is another piece of evidence that tangent distances can be used 
for identifying linear regions on manifolds. Our approach however specifically addresses the problem of 
linear manifold approximation as it is fuses nicely the tangent distances with the theory of constrained 
clustering into a simple, and yet effective clustering algorithm. 
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III. Manifold approximation Problem 

A. Framework 

We consider the problem of approximating a d-dimensional manifold Ai, embedded into R^, with 
a set of d-dimensional affine subspaces, which we call flats. The dimension d is an external parameter 
in our problem; in practice it is either specific to the application at hand or estimated a priori from 
the data. The manifold is represented by the set of samples X = {xk G R^,A: G [l,m]} and the 
undirected and symmetric neighborhood graph G;^ = G{X, E), which represents the manifold's geometry 
by connecting neighbor samples on the manifold. There exist various ways to construct E when it is 
not given a priori. We have chosen to use the k-nearest neighbor approach, i.e., we connect each sample 
in X with its k-nearest neighbors. Our objective is then to uncover a partition of X into C clusters, 
Cjc{X) = {Ci^i G so that each cluster can be represented well by a d-dimensional flat that 

respects the underlying geometry of the manifold. In order for Cjc{X) to be a valid partition of X, the 
involved clusters should not overlap and they should cover the whole set X, i.e., Cj H = 0, V z 7^ j 
and uf^^Ci = X. 

There are many different ways to partition a set into L clusters. However, in our case not all possible 
partitions of X are valid since we are interested only in partitions that respect the underlying geometry 
of the manifold. In particular, we consider the partitions whose clusters spread different regions of the 
manifold to be invalid. Although these clusters can be approximated well with flats, the resulting flats 
do not comply with the local manifold structure. Such a bad partitioning example is illustrated in Figure 



lb In order to check the compliance of a partition Cjc{X) with the manifold's shape we can use the 
graph Ga'. Then, a sufficient condition for a partition to be valid is to have clusters with connected 
subgraphs. To be more specific, each cluster's subgraph is defined as Gc, = Gx{Ci^Ei) where Ei = 
{aij e E : Xi^Xj e Ci} is the set of edges in E with both endpoints in Ci. Then, the subgraph G^ is 
connected if every pair of nodes in Ci is connected with a path in Ei. The set of all partitions that fulfill 
this condition is called tho feasible set of order C and denoted by ^^(A'). The corresponding feasibility 
predicate, ^x{Cc) = C/: G is then defined as: 



true, if Gc is connected 

^x{Cc) = ^ 0(a), where 0(a) ^{ ' (1) 

false, if Gci is not connected. 



where the symbol A stands for logical addition. 

Finally, we define the fusibility predicate i/j{Ci,Cj) that expresses the possibility of fusing clusters 
Ci and Cj. It is closely related with the feasibility predicate (j) of Eq. Q by the following property of 
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binary heredity: 

if d, Cj y^9,Cin Cj = 0, (t){Ci) A (t){Cj) and ^{d, Cj), then C^-) (2) 

This property means that the fusion of two good and related clusters should give a good cluster. In our 
case, the feasibility predicate is related to the connectivity of the clusters' graphs Gc- Therefore, we 
will allow clusters Ci and Cj to be fused only if the graph corresponding to their union, Gc.uCj, is 
connected. A sufficient condition for that is the existence of an edge between any sample in Ci and any 
sample in Cj. 

B. Tangent-based clustering 

Several data partitions are feasible, but we are interested in partitions that effectively capture the 
manifold's local geometry. In order to evaluate the 'quality' of a feasible partition C, we first need a 
criterion function P that is non-negative, distributive over the clusters in C and zero for the case of 
single-element clusters, i.e., 

P(C) = ^ p{Ci) with p{Ci) > and p{{x}) = 0, Vx E X. (3) 

The function p(Ci), which represents the distribution of P over the clusters in a partition, is non-negative 
for all clusters and zero for single-element clusters. Moreover, our goal is to uncover clusters that can 
be well-represented by d-dimensional flats. Therefore, the function p should be measuring the distance 
between a linear d-dimensional space and the manifold points in the corresponding cluster. 

According to the definition in ll22ll . a set C is a d-dimensional differentiable manifold iff 
Vx G there exist open sets V G with x and VF G as well as a one-to-one, differentiable 
function f : W ^ with continuous inverse such that 

f{w) = Mnv 

f {y) = Df{y), the Jacobian matrix of /, has rank d, G 

The function / is called a coordinate system at x. Assuming that /(a) = x, the d-rank Jacobian matrix 
Df{x) and the corresponding linear transformation /* : ^ R^ define a d-dimensional subspace of 
R^, which is the tangent space of 7W at x denoted M^. 

For some choices of x, V and W, the function / might be linear. Then, Df{a) = Df{b), Va, b G W, 
which means that the tangent spaces of all points x e M H V, seen as subspaces of Rat coincide 
when they are transferred to the same origin point. These regions can be perfectly represented by flats. 
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Therefore, the goal of our algorithm is to cluster together the samples that come from such regions on 
the manifold. 

Since our target regions are characterized by low variability of the tangent spaces, it seems appropriate 
to use a variance-based criterion function p{Ci) that measures the variance of the tangents of the samples 
in a cluster Ci, i.e., 

p{C,)= J^DUMa.M,) (4) 

where Mc- is the mean tangent over the tangents of the samples in Ci and Dt is a suitably chosen 
distance measure for the tangents. Instead of working with a set of d-dimensional subspaces that are 
positioned at point x, it is more convenient to translate all of them to the origin of R^. For the rest of 
the paper, refers to the tangent space of x translated to the origin of R^. 

We give now more details on the computation of the distance Dr. The space of all rf-dimensional 
linear subspaces of R^ is called the Grassmann manifold, denoted as Gn4 [23|. Each member of the 
GN,d can be represented by any of its bases. In our case, is actually described by such an orthonormal 
basis. A unique N x N projection matrix P = BB^, that is idempotent and of rank d, corresponds to 
each d-dimensional subspace of R^ with basis B. The set of all orthogonal projections matrices of rank 
d is called Pd^N-d and is a manifold equivalent to the Grassmann manifold. However, P^^N-d is also 
embedded in R^^^. Since R^^^ is a vector space, we can define the distance between two tangents 
Mx and My as: 

DT{Mx,My) = l^\\p^-Py\\^ = l^\\MxMj -MyM^\\F= [d - tr{M^ MyM^ M,)]^'^ (5) 

The distance Dt{Mx^ My) is called the projection metric, because it results from the projection of Gn4 
into Pd^N-d [24]. 

Finally, we can use the equivalence between GN^d and Pd^N-d to derive a computation procedure for 
the mean tangent of a cluster Ci, given in Eq. (|4]). A common definition of the mean or center of a set C 
of points in the metric space S has been given by Karcher in |[25l as the element mc G S that minimizes 
the sum of square distances V to the points x in the set, i.e., 

mc = argmin T^^{x, s) (6) 

In our case, we are given a cluster Ci, where each sample x e Ci has a tangent space Mx and a 
corresponding projection matrix Px = MxMJ. We need to compute the mean tangent M^., with 
corresponding projection matrix Pc-. Using the projection distance introduced in Eq. ([5]), Eq. ([6]) translates 
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into: 



Pc. = argmin ^ h\P^ - A\\l (7) 



, 9 



Since Pd^N-d is embedded into RnxN there is a matrix, Pc'. = XIxgC that minimizes Eq. ([7]) 
in R^^^ . However, Pc- does not have to belong to Pd^N-d- Therefore we need to project it back to 
Pd,N-d^ i-e., we need to find the matrix that minimizes 

Pc. = argmin \\P - PcMf (8) 

P^Pd,N-d 

Using the Eckart- Young theorem on low rank approximation of matrices under the Frobenius norm (261, 
we can find that the matrix the solution of Eq. ([8]) is Pc- = UU^, where U is the matrix of eigenvectors 
in the d-rank singular value decomposition of Pc,, i.e., Pc, = USdU^. The corresponding subspace on 
the Grassmann manifold is thus the one spanned by U. Therefore, Mc- = U. 

This procedure for computing the mean of a set on the Grassmann manifold is often referred to as 
an extrinsic mean computation procedure lIZTll , as it uses the equivalence between G^^d and P^^N-d to 
perform the mean computation into R^^^ instead of computing the mean into the original space. 

Equipped with the above developments, we can now formalize our manifold approximation objective 
as finding the feasible partition Cjr{X) that minimizes P, i.e., 

C£(A') = argmin P{C) = argmin 2_ \ P{Ci) (9) 

By substituting the exact form of the criterion function (|4]) in ([9]) we get the following constrained 
clustering problem: 

Cl{X) = argmin E DUM^.M,) (10) 

where is defined in ([T]), Dt is a distance measure on the Grassman manifold and Mc, is the mean 
tangent of cluster Ci. From |28|, the constrained clustering problems with the form of Eq. ^ can also 
be expressed in the form of the generalized Jensen equality f29i : 

CKX) = \ (11) 

[c2:_i(A'\C*)u{C*}, C> 1 

where 

C* = argmin {P%X \ C) + p{C)) (12) 

0CCCA' 

3Ce^c-i{^\C):CU{C}e^c{^) 

The symbol \ stands for set subtraction and U for set addition. This is a dynamic programming equation 
that may lead to polynomial time solutions under certain constraints and characteristics of the clustering 
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problem |[30ll . However, in the general case, this approach gives rise to algorithms that have exponential 
time complexity. 

An alternative way of solving the above problem is presented in [3T1. It allows for more efficient, but 
less accurate, algorithms. It is a top-down procedure, where the best clustering Cjc{X) is expressed in 
terms of the clusterings in ^c+i instead of ^c-i as in Eq. ( [TT] ). We opt for such a top-down approach 
for solving the problem in Eq. ([9]). 

In such a top-down approach, we however need a measure for comparing clusters and deciding on 
proper merging choices. Thus, we define the dissimilarity measure d : (C^, Cj) Rq" as the difference 
in the criterion function before and after the merging of two clusters, i.e., 

d{C,, Cj) = p{C, U Cj) - p{C,) - p{Cj), (13) 

assuming that the merging of any two good clusters gives always rise to a cluster with a higher score in 
terms of the criterion function. Under some mild assumptions on the relations among P, d and $ OTIl , 
we can now rewrite Eq. (|9]) as 

CliX) = [c'c+i{X) \ {Cl C-}) U {C[ U C-} (14) 

where 

{C'c+r{X),C[,C])^ argmin (P(C) + C,))) 

ip(Ci,Cj) is true 

This equation just says that, in order to find the best partition ^^^{X) we need to check all £+1 feasible 
partitions for the pair of clusters with the minimum dissimilarity, and then choose the combination that 
gives the best value for the criterion function. The partition ^^^{X) is the result of the fusion of the 
chosen pair from the selected £ + 1 partition. 

Finally, from Eq. ([14]), it is straightforward to derive a top-down greedy approximation strategy for 
the clustering problem by eliminating the search over the set $£+1 and by checking only C^^^, i.e., 

Q W = W \ {C'i,Cj}) U {Cl U C-} (15) 

where 

argmin d{Ci,Cj) 

Ci,CjGC2:+i(Af) 
ip(Ci,Cj) is true 
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IV. Greedy cluster merging for locally linear approximation 



Our manifold approximation algorithm is based on grouping the manifold samples X according to 
their tangent spaces, in order to minimize the cost function in Eq. ([10]). Our new method is divided in 
two main steps. First, we perform the necessary preprocessing steps on the samples in order to compute 
the graph Gx and the tangent spaces M^. Second, we use the graph Gx and the tangent spaces Mc's 
to greedily merge the samples into feasible clusters following Eq. ( fTS) ) until we reach a clustering with 
C components. The block diagram of the method is presented in Figure |2| . 



Sample set X 
Neighborhood parameter k 
Manifold dimensionality d 
Number of flats/clusters X 



Start with n clusters Crj 




Compute final flats F 



Partition C£ 
Flats F 



Input 




Step 1 : 
jTan.e„.spaoes 



Step 2: 

Greedy merging 



Output 



Fig. 2: The block diagram of the system. 



A. Tangent space 

In the first step of the algorithm, our objective is to compute the neighborhood graph Gx and the 
tangent spaces Mc, one for each sample x ^ X. The neighborhood graph Gx = E) is computed by 
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connecting every sample x to its k nearest neighbors. The resulting graph G^' is assumed to be undirected 
and symmetric. For each sample x we can then define a neighborhood Nx = {y ^ X : (x, y) G E} as the 
set of samples that are connected to x by an edge in Gp^. Then, we can approximate the tangent space at 
X by the d-dimensional subspace of that best approximates the data in Nx. Equivalently, we compute 
Mx as the d-dimensional subspace of that best approximates the neighborhood i.e., Nx shifted 
to the origirQ In other words, we need to compute the best d-rank approximation of the data matrix 
corresponding to N^, denoted as [N^]. Based on Eckart- Young theorem ll26ll , this approximation is equal 
to the d-rank SVD of [N^]. Therefore, the tangent space Mx corresponds to the subspace spanned by 
the eigenvectors of the d largest eigenvalues of [N^]. 



B. Greedy merging 

Once the graph and the tangent spaces are computed, we proceed with solving the optimization problem 
presented in Eq. ( fTO] ). In order to minimize the cost function, we follow the method presented in Eq. 
( fTS) ). We start with n = separate clusters, one for each sample. This is the optimal clustering for n 
clusters, i.e., C* = {{x},x G X}. Then, we reduce the number of clusters iteratively, by merging the 
clusters Gi and Gj with the minimum dissimilarity, until we reach the desired number of clusters C. 

At each iteration, there exists a set of possible mergings between the clusters in C. The fusibility 
predicate defined in Eq. ([2]) defines the sufficient condition for a merging to be feasible: any cluster Gi 
can be merged with any of its neighbors, i.e., the set NGc, = {Gj : 3x G C^, 3y E Gj s.t (x, y) G E}. 
The dissimilarity between Gi and Gj E NGd is given by Eq. ( [T3] ) and Eq. ^ as 



xeCiUCj xeCi 



J2 dUm,,Mc^) 



xeCi xeCi 
+ J2 dUMx, Mc,uc, )-J2^TiTk, Mc, ) (16) 

x^Cj x^Cj 

^We apply a shift operator to the whole neighborhood Nx, where x is the vector corresponding to the sample x in R^. 
This operator moves x to the origin and brings along all its neighorhood, while preserving all distances in it. 
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Note that since Mc^ and Mcj are the mean tangents of Ci and Cj respectively, they minimize the 
sum of the square distances from the tangents in each cluster (see Eq. ([6])). In other words, every 
other d-dimensional subspace would produce a higher value in the criterion function. As a result, 
J2xeC- ^r(Mc, ^c,uCj) is greater or equal to J2xeC- ^r(Mc,^cJ- The same holds for the cluster 
Cj. Therefore, d{Ci^Cj) is always non-negative. 

However, it is costly to make all the computations of Eq. ([16]) for all feasible mergings. We rather 
use a measure that depends only on the information that is already available to the algorithm, i.e., the 
centers of the clusters that we have computed so far and their distances to the samples in their clusters. 
Moreover, since we are using a greedy top-down approach with an initial cost equal to zero, we have 
to ensure that, at each iteration of the algorithm, the chosen merging does only marginally increase the 
overall cost. Therefore, an upper bound for d{Ci^Cj) that depends only on the means of the existing 
clusters is a suitable alternative measure for our algorithm. It contributes to reducing the complexity of 
the algorithm while controling the amount of additional cost introduced at each iteration. 

First, we observe that: 



which means that the mean tangent of Ci U Cj is closer to the mean tangent of Ci than the mean 
tangent of Cj. This statement, which also holds if we interchange the clusters Ci and Cj, is inevitably 
true. Indeed, by contradiction, if ^^^^ D'^{Mx^Mc-uCj) is larger than ^^^^ M^J, then 




(17) 



J2xeCiUCj D'^{Mx,Mc^uCj) is also strictly larger than J^xeauc, ^^^{Mx.Mc^). But, this contradicts 
the optimal character of Mc^uCj for representing Ci U Cj in terms of the projection distance. 



Then, by substituting Eq. ( fTTl ), and its equivalent form for Cj in Eq. ( [T6| ), we have: 



rf(a,C,)< [dUMx^Mc^) - D^Mx^McJ] 



xeCi 




(18) 



Moreover, by the triangle inequality: 



Dt{M,, Mc^ )<Dt{M,, Mc, ) + Dt{Mc„ Mc^ ) 



(19) 



DriM^, Mc, )<DT{M^,Ma) + DT {M^ , Mc, ) 



(20) 
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Taking the square of these inequalities and summing over Cj and Ci respectively we get: 



J2 [Dl{M,,Mc^) - Dl{M^,Mc^)\ <2Dt{Mc^^Mc^) Dt{M,^Mc^) + \Cj\dI{Mc^,Mc^ 



J2 [dUM.,Mc^) - DUM.,Ma)] < 2Dt{Mc^,Mc,) J] DriM^Ma) + \Ci\DUMc., Mc^ 

xeCi x^d 



Substituting Eq. ( [2T] ) in Eq. ([18]) we finally have the following dissimilarity measure: 

d{Ci,Cj) < i\Ci\ + \Cj\)DUMa,Mc,) 



(21) 



(22) 



J2 DTiM,,Ma) + Yl Dt{M,.Mc;, 

xeCi xeCi 



^2Dt{Mc^,Mc;^ 



which depends only on pre-computed information. By comparing Eq. ([22]) with Eq. ( [16] ), we can observe 
that Eq. ( [22] ) is indeed more computationally efficient as it involves only the means of the existing clusters 
and not those of the clusters after merging the fusible pairs. In our algorithm, the costs for all possible 
mergings at each iteration are thus computed according to Eq. ([22]). The clusters with the minimum 
estimated merging cost are then combined and the mean of the newly formed cluster is computed as 



shown in Section [III-B| The procedure is then repeated until we reach the desired number of clusters C. 

At the end, each cluster represents a group of samples that can be well approximated by a d-dimensional 
flat. We compute the final flats for each cluster and we use the subspace spanned by the eigenvectors 
corresponding to the d largest eigenvalues of each cluster's data matrix as representative subspace. The 
overall manifold approximation algorithm is summarized in Algorithm 1. 



C. Computational complexity 

We analyze here briefly the complexity of our approximation algorithm. Computing the cost of a 
possible merging with Eq. ( [22] ) requires only the computation of one additional tangent distance at each 
step. Denoting by A^^-A the number of possible mergings in the clustering C^_a, the complexity of one 
step of the greedy merging (line 9 in Algorithm 1) requires n + Kn-x computations of tangent distances, 
where n = \X\ is the initial number of clusters. Therefore, the greedy merging (lines 8-12 in Algorithm 
1) will be performed in O (^^Zi{n + Kn_x)^ time. 

We now estimate the number of possible mergings Kn-x- Since, at each step of the algorithm, we 
perform one merging operation, we will have exactly n — X clusters at step A. Moreover, each Ci G C^_a 
can have a maximum size of A+ 1, and therefore we have that Ci has at most A:(A + 1) different neighbors. 
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Algorithm 1 Agglomerative clustering based on differences of tangents (ACDT) 
Input: X,k,C,d 

Step 1 * Preprocessing * 

1: Construct G[X^E) by connecting each element in X with its fc-nearest neighbors. 

2: for dUl X e X do 

3: Nx = {y ^ X : [x^y) ^ E} * Compute neighborhoods * 

where [A^^] is the data matrix formed by the elements in Nx shifted to the origin of and [/, 5, V 
are the results of its d-rank SVD. 
5: Mx = U * Compute tangent spaces * 

6: end for 

Step 2 * Greedy computation of partition * 

7: n = lA^I, A = 0, C* = {{x} : x G Af} * Initialization * 

8: for A < n — >C do * Greedy merging * 

9: (C-,C-)= argmin d{Ci,Cj) 

ip{Ci,Cj) is true 

10: c;_,^i = (c;_, \ {cl,c;}) u {c: u c^} 

11: A = A + 1 
12: end for 

13: for Ci G do * Compute the final flats Fi * 

14: [C:J,J = USV^ 

where rrii is the sample mean of [C^J is the data matrix formed by the samples in Ci shifted 

by rrii and [/, V are the results of its d-rank SVD. 
15: Fi = U 
16: end for 
Output: C^.F 
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Thus, the number of possible mergings is at most Kx < ^^{X + 1)|C^_a| = ^)(^ ~ Thai 

means that the running time T{n) of the algorithm is of the following order: 

n—jC / 1 \ /n—jC r ^ 

r(n) = J^O (n + -/c(A + l)(n- A) ) = O I n + -fc(A + l)(n - A) 
A=i ^ ^ ^ Va=i L 2 

/n—C \ 

= O ( n^) (23) 

In fact, this is a rather loose bound. If one of the clusters Ci G Cn-x has a size A, then all the other 
clusters have a size of 1. Therefore, it is better to use the average size of a cluster when computing 

~ 77/ 1 77/ 1 

the number of different neighbors, which is equal to \Ci\ = rp^- Then, Kx < oICaI^t;^;;^ = -kn. 

\Cx\ 2 |Ca| 2 

Therefore, on average, the running time of the algorithm is equal to T{n) = O (n^)- 

In comparison, if we were using the exact formula for the dissimilarity measure in Eq. ([16]), we would 
need to do one additional SVD computation and \Ci U Cj\ additional distance computations for each 
possible merging in C^. Even if we assume that the computational cost of the SVD decomposition is equal 
or equivalent to that of a tangent distance computation, one step of the merging algorithm would require 

77/ 77/ 

77/ + (1 + |Ci U Cj\)Kx new computations. Using the average estimate for \Ci U Cj\ 



\Cn-x\ 77/ -A' 

we can estimate the running time for the greedy merging to be O ( X^A^f + ) • This 

\ 77/ /\. Zi I 

means that the average running time of the algorithm with the exact dissimilarity measure is Tsvuip) = 
O ^^^^^ ZlA^f — ^ ^ {^\^'^{Jin-\ — HjC-i)^ = O {v? In 77/) which is higher than the average 
running time of our approximate algorithn|^ 

V. Experimental results 

We have conducted two different sets of experiments to study the performance of our manifold 
approximation scheme. In the first one, we have tested the performance in approximating the manifold 
data for both synthetic and real datasets. In the second one, we have studied the use of flats for handwritten 
digit classification in a simple distance-based classification scheme with the MNIST dataset |[32ll . 

A. Manifold Data approximation 

We compare our scheme with two other manifold approximation approaches from the literature, namely 
the Hierarchical Divisive Clustering (HDC) |3 | and the Hierarchical Agglomerative Clustering (HAC) 
|[T9l . The HDC algorithm starts with considering all the data as one cluster and then hierarchically 

^ Hn , He are the harmonic numbers of order n and C respectively 
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partitions them by dividing highly non-linear clusters. As a linearity measure, it uses the deviation 
between the Euclidean and geodesic distances, i.e., each cluster gets a nonlinearity score that is equal 
to the average ratio of geodesic over Euclidean distances for all the pairs of samples in the cluster. The 
process continues until all existing clusters have a nonlinearity score that is lower than a given threshold. 
On the other hand, HAC is a bottom-up algorithm, i.e., each sample is considered at the beginning 
as a separate cluster and then clusters are merged iteratively until their number reduces to the desired 
target. At each iteration of the algorithm, the pair of clusters with the minimum distance is merged. 
The distance between two clusters is measured as the average geodesic distance between the samples of 
the one cluster and the samples of the other. Our scheme follows also a bottom-up strategy; however 
our distance measure is completely different than the one in |[T9ll . The results of our tests for all three 
algorithms are following. 

1) Synthetic Data: Firstly, we test the performance of our scheme in approximating synthetic mani- 
folds. We use the Swiss roll and the S-curve dataset. The training set for both cases consists of 5000 points, 
randomly sampled from the manifolds. The neighborhood size k is set equal to 15 in the experiments. It 
is preferable to use low values for fc, varying from 0.5% to 2% of the total number of samples, in order 
to avoid "short-circuit" effects that distort the manifold structure. In order to quantify the performance, 
we use the mean squared reconstruction error (MSRE) defined as MS RE = Yl^=i \ ^iW^ where 
Xi and Xi is respectively a sample and its projection on the corresponding approximating flat, and N is 
the total number of signals. 

The MSRE versus the number of flats, for our synthetic manifolds, is presented in Figure |3] The 
results are averaged over 10 randomly chosen training sets. From Figure [3j we can see that our scheme 
approximates better the manifold structure than the other approaches. The approximation performance 
is better even for a small number of flats but the differences are more evident in the mid-range cases 
where the number of flats is between 15 and 30. For higher number of flats, the difference decreases and 
stabilizes around 50 to 60 flats when the MSREs of the algorithms converge. The effectiveness of our 
method is mainly due to the use of the difference of tangents for measuring the linearity of sample sets 
instead of the geodesic -based criteria used by other algorithms ||3l, llT9]l . Finally, an example of the final 
groups computed by our algorithm is shown in Figure |4] for the case of 12 flats. In this figure, we see that 
the structure of the manifold is correctly preserved by the proposed manifold approximation algorithm. 

2) Natural patches: We have also tested the performance of our scheme in approximating natural 
image patches since they may belong to a manifold in some applications |33|. The manifold samples are 
taken from the training set of the Berkeley Segmentation Dataset (BSDS) [34|. Each patch is of size 8x8 
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Number of flats 



10 20 30 40 50 60 

Number of flats 



(a) Swiss roll 



(b) S -curve 



Fig. 3: Mean squared reconstruction error (MSRE) versus the number of flats. The error on the y-axis is 
shown in logarithmic scale. 




(a) Swiss roll (b) S-curve 

Fig. 4: The final groups formed by the proposed approximation algorithm with 12 flats. Each represents 
a different cluster of points. 

and it captures a square region of a natural image. Before approximating the manifold, we preprocess 
the patches so that they have zero mean and unit variance. For constructing the manifold we use 10,000 
patches and k is set equal to 100. 

The approximation performance (in terms of the MSRE) versus the number of flats is presented in 
Figure |5j We have plotted the approximation error for three different choices of the flats' dimensionality. 
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i.e., d = 16, 32 and 60 respectively. We can see that, in all cases, our scheme approximates significantly 
better the manifold structure than the other approaches. This time, the performance is higher for the whole 
range of the number of flats. As the number of flats approaches 400, the performance of the algorithms 
stabilizes especially for the case of d = 32 and 60 dimensions. 



Natural Patches, 1 6 dim Natural Patches, 32 dim 




Number of flats Number of flats 



(a) 16-dimensional flats (b) 32-dimensional flats 



— e— HOC 




50 100 150 200 250 300 350 400 

Number of flats 

(c) 60-dimensional flats 

Fig. 5: MSRE for natural patches for different choices of the flats' dimensionality. The error on the y-axis 
is shown in logarithmic scale. 



B. Classification 

We finally check the application of our flat-based approximation to classification problems. We have 
built a simple classification scheme. Assuming that we have m datasets, each belonging to a different 
class, we run at first our scheme for approximating the underlying manifold of each class with n flats. 
We denote the set of resulting flats by S. Then, for each sample, we create amxn dimensional vector of 
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Feature space 


Original space 


PCA space 


95.2 % 


96.8 % 


92 % 



TABLE I: Classification performance for MNIST dataset with 10 flats per manifold. 



features, where each entry corresponds to the distance of a sample to a flat in S. Each new unclassified 
sample can now be projected to this flat-based feature space by the same procedure. Finally, we perform 
classification based on nearest neighbor criteria in this flat-based feature space. 

We have checked the performance of classification for the MNIST dataset. For each digit we use 2000 
random samples to construct the manifold and n = 10 flats to approximate it. The number of neighbors k 
is set equal to 20 and the dimensionality of each flat is d = 4. For the testing we use 1000 new samples 
from each class. We have found that the flat-based features yield a classification rate that is comparable 
to that of the nearest-neighbor classification in the original space and better than the rate achieved by 
PCA. The results are shown in Table |l| Therefore, we can say that the flats uncovered by our manifold 
approximation scheme manage to capture and preserve the crucial characteristics of the manifolds that 
could be used to discriminate samples in a space of reduced dimensionality (in our case, a space of 100 
instead of 784). Finally, there is certainly a lot of space for improvement for such kind of applications 
by explicitly enhancing the discriminative power of the flat-based representation during the manifold 
approximation step. 

VL Conclusion 

We have presented a new greedy algorithm for approximating a manifold with low dimensional flats 
based on the difference of tangent spaces. Our method is shown to be quite powerful for manifold 
approximation where it outperforms state-of-the-art manifold approximation approaches. The final low- 
dimensional representation of signals from the manifold can be used for data compression or signal 
classification. In the future, we will explore ways to uncover manifold approximations that are especially 
useful for classification. We will also extend our method to other problems like image denoising and 
restoration, manifold to manifold distance computations as well as geodesic distance computations on 
manifolds. 
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